{"nbformat":4,"nbformat_minor":0,"metadata":{"anaconda-cloud":{},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.5.2"},"colab":{"name":"Tutorial II.ipynb","provenance":[],"collapsed_sections":["YLKYlLJRgW91","EqADCB3D9Rxo","DInAkgl1gW93","HJaPvpOFgW98","XpJ1CBW2gW99","UtXcbsUJgW-I","gpxvwQgKgW-L"],"toc_visible":true}},"cells":[{"cell_type":"markdown","metadata":{"id":"berwLM3NgW9v","colab_type":"text"},"source":["# Tutorial II: Optimization in TensorFlow & NN introduction"]},{"cell_type":"markdown","metadata":{"id":"E3q4heOGgW9x","colab_type":"text"},"source":["

\n","Bern Winter School on Machine Learning, 27-31 January 2020
\n","Prepared by Mykhailo Vladymyrov.\n","

\n","\n","This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License."]},{"cell_type":"markdown","metadata":{"id":"YLKYlLJRgW91","colab_type":"text"},"source":["## 1. Load necessary libraries"]},{"cell_type":"code","metadata":{"id":"i4bwmIrggalE","colab_type":"code","colab":{}},"source":["# if using google colab\n","%tensorflow_version 2.x"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"ZjwECu5bgW92","colab_type":"code","colab":{}},"source":["import sys\n","\n","import numpy as np\n","import matplotlib.pyplot as plt\n","import IPython.display as ipyd\n","import tensorflow.compat.v1 as tf\n","tf.disable_v2_behavior()\n","\n","# We'll tell matplotlib to inline any drawn figures like so:\n","%matplotlib inline\n","plt.style.use('ggplot')\n","\n","from IPython.core.display import HTML\n","HTML(\"\"\"\"\"\")"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"EqADCB3D9Rxo","colab_type":"text"},"source":["### Download libraries"]},{"cell_type":"code","metadata":{"id":"yMxFBHhV9Rxq","colab_type":"code","colab":{}},"source":["p = tf.keras.utils.get_file('./material.tgz', 'https://scits-training.unibe.ch/data/tut_files/t123.tgz')\n","!mv {p} .\n","!tar -xvzf material.tgz > /dev/null 2>&1"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"cInCJilaOC5A","colab_type":"code","colab":{}},"source":["from utils import gr_disp"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"DInAkgl1gW93","colab_type":"text"},"source":["## 1. Linear fit"]},{"cell_type":"markdown","metadata":{"id":"4ELO5INFgW94","colab_type":"text"},"source":["Here we will solve optimization problem to perform linear regression. First we will generate training set of 80 data points and test set of 20, laying on a line with a random offset $$y = a_0 x + b_0 + \\delta,$$ where $\\delta$ is a random variable sampled from a normal distribution with standard deviation equal to $s_0$"]},{"cell_type":"code","metadata":{"id":"-h6vmx-7gW94","colab_type":"code","colab":{}},"source":["a0 = 3 # 3\n","b0 = 5 # 5\n","s0 = 2 # 1\n","\n","# all samples\n","x_all = np.linspace(0, 10, 100) # 100 points\n","n_all = x_all.shape[0]\n","d_all = np.random.uniform(-s0, s0, size=n_all)\n","y_all = np.asarray([a0*x + b0 + d for x, d in zip(x_all, d_all)])\n"," \n","# randomize order and get 80% for training\n","idx = np.random.permutation(n_all)\n","n_train = n_all * 80 // 100\n","\n","idx_train = idx[0:n_train]\n","idx_test = idx[n_train:]\n","\n","x_train = x_all[idx_train]\n","y_train = y_all[idx_train]\n","\n","x_test = x_all[idx_test]\n","y_test = y_all[idx_test]\n","\n","plt.plot(x_train, y_train, \"o\", x_test, y_test, \"b^\")\n","plt.legend(('training points', 'test points'), loc='upper left')"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"sERniBfcgW96","colab_type":"text"},"source":["We will then define loss function as the mean of squared residuals (distance from line along $y$) for the points.\n","\n","We will use [stochactic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent): on each iteration use only a fraction (`mini_batch_size`) of all training set. In many cases training set is huge and cannot be fed on each iteration in principle. Also it can sometimes help the optimizer to properly explore the manifold."]},{"cell_type":"code","metadata":{"id":"OC0_ZfsXgW97","colab_type":"code","colab":{}},"source":["tf.reset_default_graph()\n","\n","# here we have 2 trainable parameters, a and b\n","a = tf.get_variable(name='a', dtype=tf.float32, initializer=tf.random_normal(()))\n","b = tf.get_variable(name='b', dtype=tf.float32, initializer=tf.random_normal(()))\n","x = tf.placeholder(name='x', dtype=tf.float32, shape=(None))\n","y = tf.placeholder(name='y', dtype=tf.float32, shape=(None))\n","\n","residual = y - (x*a + b)\n","residual2 = tf.square(residual)\n","loss = tf.reduce_mean(residual2)\n","\n","optimizer = tf.train.GradientDescentOptimizer(0.002).minimize(loss) # 0.002\n","mini_batch_size = 10 # 10\n","\n","l_sv_train = []\n","l_sv_test = []\n","with tf.Session() as sess:\n"," # initialize all the variables \n"," sess.run(tf.global_variables_initializer())\n"," \n"," # iterate training for 1001 epoch\n"," for epoch in range (1001):\n"," # shuffle the data and perform stochastic gradient descent by runing over all minibatches\n"," idx = np.random.permutation(n_train)\n"," for mb in range(n_train//mini_batch_size):\n"," sub_idx = idx[mini_batch_size*mb:mini_batch_size*(mb+1)]\n"," sess.run(optimizer, feed_dict={x:x_train[sub_idx], y:y_train[sub_idx]})\n"," \n"," # evalueate and save test and training loss, to be plotted in the end\n"," l_val_test = sess.run(loss, feed_dict={x:x_test, y:y_test})\n"," l_val_train = sess.run(loss, feed_dict={x:x_train, y:y_train})\n"," if epoch%100==0:\n"," l_val, a_val, b_val = sess.run([loss, a, b], feed_dict={x:x_train, y:y_train})\n"," print(epoch, l_val, a_val, b_val)\n"," \n"," l_sv_train.append(l_val_train)\n"," l_sv_test.append(l_val_test)\n"," \n","end_fit_x = [x_all[0], x_all[-1]]\n","end_fit_y = [a_val*x+b_val for x in end_fit_x]\n","true_fn_y = [a0*x+b0 for x in end_fit_x]\n","fig, axs = plt.subplots(2, 1, figsize=(10,8))\n","axs[0].plot(x_train, y_train, 'ro', x_test, y_test, 'b^', end_fit_x, end_fit_y, 'g')\n","axs[0].legend(('training points', 'test points', 'final fit'), loc='upper left')\n","ep_arr = np.arange(len(l_sv_train))\n","axs[1].semilogy(ep_arr, l_sv_train, 'r', ep_arr, l_sv_test, 'b')\n","axs[1].legend(('training loss', 'test loss'), loc='upper right')\n"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"HJaPvpOFgW98","colab_type":"text"},"source":["## 2. Excercise 1"]},{"cell_type":"markdown","metadata":{"id":"CfFwDCNEgW99","colab_type":"text"},"source":["Play with the true function parameters ```a0, b0, s0``` and the ``mini_batch_size`` value, check how it affects the convergence.\n","\n","1. How change of `s0` affects convergance?\n","2. When one should stop training to prevent overfitting?"]},{"cell_type":"markdown","metadata":{"id":"XpJ1CBW2gW99","colab_type":"text"},"source":["## 3. A bit of things"]},{"cell_type":"markdown","metadata":{"id":"zJUUnwWcgW9-","colab_type":"text"},"source":["The training as we just saw is done iteratively, by adjusting the model parameters.\n","\n","We perform optimization several times for all traininng dataset. Going through all this dataset is refered to as 'epoch'.\n","\n","When we do training its usually done in two loops. In outer loop we iterate over all epochs. For each epoch we usually split the dataset into small chuncks, 'mini-batches', and optimization it performed for all of those.\n","\n","It is important that data doesn't go to the training pipeline in same order. So the overall scheme looks like this (pseudocode):\n","\n","\n","```\n","x,y = get_training_data()\n","for epoch in range(number_epochs):\n"," x_shfl,y_shfl = shuffle(x,y)\n"," \n"," for mb_idx in range(number_minibatches_in_batch):\n"," x_mb,y_mb = get_minibatch(x_shfl,y_shfl, mb_idx)\n"," \n"," optimize_on(data=x_mb, labels=y_mb)\n","```"]},{"cell_type":"markdown","metadata":{"id":"riyTt-xEgW9-","colab_type":"text"},"source":["Shuffling can be easily done using permuted indexes."]},{"cell_type":"code","metadata":{"id":"_rd-BxRugW9_","colab_type":"code","colab":{}},"source":["# some array\n","arr = np.array([110,111,112,113,114,115,116])\n","\n","# we can get sub-array for a set of indexes, eg:\n","idx_1_3 = [1,3]\n","sub_arr_1_3 = arr[idx_1_3]\n","print (arr,'[',idx_1_3,']','->', sub_arr_1_3)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"caAztS5MgW-B","colab_type":"code","colab":{}},"source":["ordered_idx = np.arange(7)\n","permuteded_idx = np.random.permutation(7)\n","print(ordered_idx)\n","print(permuteded_idx)\n","\n","permuted_arr = arr[permuteded_idx]\n","print (arr,'[',permuteded_idx,']','->', permuted_arr)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"zrmUh6GOgW-C","colab_type":"text"},"source":["Some additional `np` things in this tutorial:"]},{"cell_type":"code","metadata":{"id":"agVVtXbUgW-D","colab_type":"code","colab":{}},"source":["# index of element with highest value\n","np.argmax(permuted_arr)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"GXLIXly4gW-E","colab_type":"code","colab":{}},"source":["arr2d = np.array([[0,1],[2,3]])\n","print(arr2d)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"-9q8RkuUgW-G","colab_type":"code","colab":{}},"source":["# flatten\n","arr_flat = arr2d.flatten()\n","# reshape\n","arr_4 = arr2d.reshape((4))\n","arr_4_1 = arr2d.reshape((4,1))\n","\n","print (arr_flat)\n","print (arr_4)\n","print (arr_4_1)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"UtXcbsUJgW-I","colab_type":"text"},"source":["## 4. Bulding blocks of a neural network"]},{"cell_type":"markdown","metadata":{"id":"iZ1JOmpCgW-I","colab_type":"text"},"source":["Neural network consists of layers of neurons. Each neuron perfroms 2 operations.\n","\n","\"drawing\"\n","\n","1. Calculate the linear transformation of the input vector $\\bar x$: \n","$$z = \\bar W \\cdot \\bar x + b = \\sum {W_i x_i} + b$$ where $\\bar W$ is vector of weights and $b$ - bias.\n","2. Perform the nonlinear transformation of the result using activation function $f$ $$y = f(z)$$ Here we will use rectified linear unit activation.\n","\n","In a fully connected neural network each layer is a set of N neurons, performing different transformations of all the same layer's inputs $\\bar x = [x_i]$ producing output vector $\\bar y = [y_j]_{i=1..N}$: $$y_j = f(\\bar W_j \\cdot \\bar x + b_j)$$\n","\n","Since output of each layer forms input of next layer, one can write for layer $l$: $$x^l_j = f(\\bar W^l_j \\cdot \\bar x^{l-1} + b^l_j)$$ where $\\bar x^0$ is network's input vactor.\n","\n","\"drawing\"\n","\n","To simplify building the network, we'll define a helper function, creating neuron layer with given number of outputs:"]},{"cell_type":"code","metadata":{"code_folding":[],"id":"iaU1t0WngW-J","colab_type":"code","colab":{}},"source":["def fully_connected_layer(x, n_output, name=None, activation=None):\n"," \"\"\"Fully connected layer.\n","\n"," Parameters\n"," ----------\n"," x : tf.Tensor\n"," Input tensor to connect\n"," n_output : int\n"," Number of output neurons\n"," name : None, optional\n"," TF Scope to apply\n"," activation : None, optional\n"," Non-linear activation function\n","\n"," Returns\n"," -------\n"," h, W : tf.Tensor, tf.Tensor\n"," Output of the fully connected layer and the weight matrix\n"," \"\"\"\n"," if len(x.get_shape()) != 2:\n"," x = flatten(x, reuse=None)\n","\n"," n_input = x.get_shape().as_list()[1]\n","\n"," with tf.variable_scope(name or \"fc\", reuse=None):\n"," W = tf.get_variable(\n"," name='W',\n"," shape=[n_input, n_output],\n"," dtype=tf.float32,\n"," initializer=tf.initializers.glorot_uniform())\n","\n"," b = tf.get_variable(\n"," name='b',\n"," shape=[n_output],\n"," dtype=tf.float32,\n"," initializer=tf.constant_initializer(0.0))\n","\n"," h = tf.nn.bias_add(\n"," name='h',\n"," value=tf.matmul(x, W),\n"," bias=b)\n","\n"," if activation:\n"," h = activation(h)\n","\n"," return h, W"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"J7gzOyONgW-K","colab_type":"text"},"source":["In the case of classification, in the the last layer we use *softmax* transformation as non-linear transformation: $$y_i = \\sigma(\\bar z)_i = \\frac{ e^{z_i}}{\\sum_j e^{z_j}}$$"]},{"cell_type":"markdown","metadata":{"id":"EyRaNDlVgW-L","colab_type":"text"},"source":["This will correspond to the one-hot labels that we use.\n","Finally we will use the cross entropy between output $y$ and the ground truth (GT) $y_{GT}$ as the loss function: $$H(y, y_{GT}) = - \\sum_i y_{GT, i} \\log(y_{i})$$\n"]},{"cell_type":"markdown","metadata":{"id":"gpxvwQgKgW-L","colab_type":"text"},"source":["## 5. Bulding a neural network"]},{"cell_type":"code","metadata":{"id":"Ba_cMJuJgW-M","colab_type":"code","colab":{}},"source":["n_input = 10\n","n_output = 2"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"IqlSKoY8gW-O","colab_type":"code","colab":{}},"source":["g = tf.Graph()\n","with g.as_default():\n"," X = tf.placeholder(name='X', dtype=tf.float32, shape=[None, n_input])\n"," Y = tf.placeholder(name='Y', dtype=tf.float32, shape=[None, n_output])\n"," \n"," #layer 1: 10 inputs -> 4, softmax activation\n"," L1, W1 = fully_connected_layer(X , 4, 'L1', activation=tf.sigmoid)\n"," L2, W2 = fully_connected_layer(L1 , 2, 'L2', activation=None)\n"," Y_onehot = tf.nn.softmax(L2, name='Logits')\n"," \n"," #prediction: onehot->integer\n"," Y_pred = tf.argmax(Y_onehot, axis=1, name='YPred')"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"xwgSoCjQgW-P","colab_type":"code","colab":{}},"source":["print(X)\n","print(Y)\n","# print(Y_pred)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"cINev65WgW-R","colab_type":"code","colab":{}},"source":["gr_disp.show(g.as_graph_def())"],"execution_count":0,"outputs":[]}]}